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Abstract 

We present an algorithm for the exhaustive enumeration of all monomer sequences 
and conformations of short lattice proteins as described by the hydrophobic-polar 
(HP) model. The algorithm is used for an exact identification of all designing se- 
quences of HP proteins consisting of up to 19 monomers whose conformations are 
represented by interacting self-avoiding walks on the simple cubic lattice. Employ- 
ing a parallelized implementation on a Linux cluster, we generate the complete set 
of contact maps of such walks. 

Key words: 

lattice proteins, HP model, contact maps, self-avoiding walks, polymers 
PACS: 05.10.-a, 87.15.Aa, 87.15.Cc 



1 Introduction 

The numerical treatment of protein models is highly nontrivial. On one hand, 
the design of realistic models suffers from the fact that the atomic interactions 
among the constituents of proteins and with their aqueous cellular environ- 
ment are by no means well understood [1]. On the other hand, the computa- 
tional effort increases drastically with the length of the molecules. Therefore, 
significant simplifications of the realistic situation have to be introduced in 
order to facilitate a detailed analysis based on computational methods and, in 
particular, to allow studies of the relation between sequence and conformation 
spaces of model proteins. 
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Herein, we will consider two versions of the very simple HP lattice model [2,3] 
which makes the following assumptions: Instead of considering all 20 different 
kinds of amino acids that occur in real proteins the model comprises only 
two prototypes of residues: hydrophilic (or polar, P) and hydrophobic (H) 
monomers, respectively. This is to account for the fact that most of the nat- 
urally occurring amino acids can be classified in that way (Ref. [1], p. 154). 
Also the atomar interactions are drastically simplified. Short-range repulsion 
between monomers is taken into account by modeling the conformations of 
HP proteins as self-avoiding walks on regular lattices. The simple cubic (sc) 
lattice was used in this study. In addition one considers in the most simple 
formulation of the model exclusively a nearest-neighbor attractive interaction 
between hydrophobic residues non-adjacent in the polymer chain [2]. Slightly 
more involved variants also take into account nearest-neighbor contacts be- 
tween HP and/or PP pairs [3]. This is an effective way of describing the 
interaction of the molecule with the aqueous environment [4]. 

Exact enumeration results obtained for short HP proteins can be used as 
cross-checks for other non-exact methods that search the conformational and 
sequence spaces of proteins. These include Monte Carlo and genetic algorithms 
(e.g. Refs. [5,6]), generalized ensemble techniques (e.g. Ref. [7]), chain growth 
algorithms (e.g. Refs. [8,9]), and combinations thereof (e.g. Refs. [10,11]). More 
importantly, the complete treatment of all sequences and conformations allows 
one to carry out systematic statistical analyses of HP proteins. Our results for 
the sc lattice described in more detail in Ref. [12] complement prior exact 
enumeration studies on the square lattice [13] and for HP proteins with con- 
formations restricted to regular cuboids on the sc lattice [14,15]. 

In the next section we introduce the HP models used here in a little more 
formal way. Section 3 explains the exact enumeration procedure in terms of 
which our results are obtained. The concept of exact enumeration is first illus- 
trated with a naive implementation. What remains of Section 3 is dedicated 
to improvements of that simple implementation and describes how these im- 
provements apply to a simple example case. In Section 4 we show how our 
exact results can be applied for a comparison of the numbers of self-avoiding 
walks and contact matrices and for the determination of designing sequences 
in the HP model. Section 5 concludes this article with a summary and an out- 
look on further statistical analyses based on the results of the enumerations 
presented here. 



2 HP Models 

An HP protein is defined by its sequence of monomers. We will denote the 
type of monomer by <7j, where % — 1, . . . , N is the position of the monomer in 
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a polymer chain of length N and by convention <7j G {0—P, 1=H}. Its con- 
formation, which is a self-avoiding walk on the lattice (with lattice spacing 
a = 1), is represented by an ordered collection of lattice vectors that con- 
tain the positions of the residues: X = (x 1; x 2 , . . . , xjv). The distance between 
monomers i and j is denoted by Xij = |xj — x 3 -|. The attractive interaction 
between pairs of residues is short-ranged on the underlying lattice. It is con- 
sidered only between residues that are on nearest-neighbor positions but not 
covalently bound in the molecular chain. Such a pair of residues is said to be 
in contact. This is expressed by the following energy function that is assigned 
to each HP protein: 

E= E CyU^, (1) 
i,j>i+l 

where C,j = (1 — f° r x ij — 1> an d zero otherwise, is a symmetric N x N 

matrix called contact map and 



is the 2x2 interaction matrix. 



u H h u HP ^ 



UHP Upp J 



(2) 



In the present study, the HP model comes in two versions that are different 
from one another in the way attractive interactions between the amino acids 
are considered. In the original version of the model [2] , which we will refer to 
as HP model in the following, only a pair of hydrophobic residues in contact 
contributes to the energy function (1) and the only non-zero entry in the 
interaction matrix is u^f H = — 1. A modification of the model [3] also takes 
into account an interaction between hydrophobic and polar monomers. We call 
it the MHP (mixed HP) model. Its interaction matrix entries read = 
-1, u^f = -1/2.3 « -0.435, and uf^ p = 0. The magnitude of u^f is 
motivated by an analysis of inter-residue contact energies between different 
types of real amino acids [4] . 

A sequence of monomers is called a designing sequence if there exists exactly 
one conformation (up to trivial symmetries, to be explained in more detail 
below) for its state of lowest energy. The interest in designing sequences is 
based on a generally accepted biochemical principle that sequence specifies 
conformation, and, in turn, the conformation of a polymer determines its bi- 
ological function. Accepting this principle also in the framework of the highly 
simplified HP model leads directly to the concept of designing sequences. The 
conformation of the lowest-energy state is uniquely determined for designing 
sequences only. Furthermore, the number of designing sequences is very small 
compared to the total number of 2^ HP sequences of a given chain length N. 
Thus, the ability of identifying designing sequences may be seen as a bench- 
mark for algorithms that search the sets of conformations and sequences of 
HP proteins. 
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3 Exact Enumeration 

3.1 Naive implementation 

A straightforward method of identifying designing sequences of a given length 
is to perform an exact enumeration. This means to run through the whole set 
of sequences and for each sequence through the whole set of conformations. 
Consider such a deliberately naive enumeration for short sequences of length 
N = 4 in the HP model. Trivially, there are 2 4 = 16 different sequences and 
6x5x5 = 150 self-avoiding walks on the sc lattice. 

Up to symmetries, Fig. 1 shows all conformations for N = 4. Only the con- 
formation designated by FLL has a contact between its first and last residues. 
Consequently, all four HP sequences with a hydrophobic monomer in the first 
and last positions of the sequence must be designing: there is only one confor- 
mation for the lowest energy E = — 1. All other sequences are non-designing 
since the energy E = is obviously degenerate. 

When increasing the number of monomers N by one, the number of sequences 
doubles. Also, the number of self-avoiding walks (SAW) is known to increase 
asymptotically by a factor of /xsaw ~ 4.684 [16,17,18]. Consequently, the com- 
putational effort scales roughly as 9.37^, i.e., exponentially fast with the chain 
length N. This is why improvements of the naive enumeration become neces- 
sary even for rather short chain lengths. 

3.2 Improvements 

Some of these improvements are very obvious. Firstly, there is no point in car- 
rying out the enumeration for two sequences that contain the same monomers 
but in reverse order. For example, the sequences HPPP and PPPH are equiva- 
lent. Simply counting the number of relevant sequences, Rn, that have to be 
considered in the enumeration yields 



3.2.1 Symmetries on the sc lattice 

Furthermore, not all self-avoiding walks are independent but related to each 
other by symmetry operations. On the sc lattice, there are six directions for 
the first bond of any conformation. Fixing the direction of the first bond, we 
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Fig. 1. All relevant conformations for N = 4 together with their configurational 
chain codes explained in the text. 



are left with (1/6) / S (X) mutually symmetric conformations, where / S (X) is 
the total number of mutually symmetric self-avoiding walks starting from the 
origin. Figure 2 illustrates these symmetries for the conformation encoded by 
FLU in Fig. 1. For a given conformation X, the symmetry factor is given by 



6 if X linear, 
/ s (X) = \ 24 if X planar, 
48 otherwise. 



(4) 



We represent conformations by means of chain codes that encode the steps 
of self-avoiding walks in terms of a sequence of relative moves. On the sc 
lattice there are five kinds of such moves which we denote by F ("forward"), 
L ("left"), R ("right"), U ("up"), and D ("down"). The chain codes for all 
independent conformations consisting of four monomers are shown in Fig. 1. 
Two vectors are needed in order to define the five moves on the sc lattice: 
Let Oj be a unit vector attached to the monomer at Xj and s$ another unit 
vector at Xj perpendicular to Oj determining the direction of the i th step of the 
self-avoiding walk as shown in Fig. 3. Given a chain code, we determine the 
conformation X by initially choosing Xi = (0, 0, 0), Oi = (0, 0, 1), S\ = (0, 1, 0) 
and by specifying how to go over from {x^ Oj, s^} to {x i+1 , o i+1 , s i+1 } for each 



(a) 



A 



(b) 



(f) 



(g) 



V 



(d) 



00 



Fig. 2. All (1/6) / S (X) = 8 conformations are symmetric on the sc lattice and 
their first bonds show into the same direction. Symmetry operations applied to the 
conformation (a) are rotations about the first bond (b,c,d), reflections at a lattice 
plane (e,f), and two compositions of rotations and reflections (g,h). 
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Fig. 3. Encoding a conformation in terms of relative moves on the lattice. Vectors 
designated by F, L, R, U, D are Sj + i(F), Sj+i(L), etc. 

of the five moves: 




(5) 



(6) 



(7) 



Equations (5), (6), and (7) can be read off from Fig. 3. 



In exact enumeration, it is not desirable to enumerate conformations that 
are symmetric to each other. This is easily achieved by enumerating the chain 
codes for conformations of a given length N considering only codes that satisfy 
a chosen set of rules. The choice of x 1; o l5 and Si determines the first move 
which we call by convention F. Furthermore, we require the first move that 
makes the walk deviate from a linear conformation to be an L-move and, 
subsequently, we require the first step into the third coordinate direction to 
be a U-move. For conformations of length N = A modeled as self-avoiding 
walks of three steps there are six different chain codes obeying these rules and 
coding for not mutually symmetric conformations (see again Fig. 1). 



3.2.2 Contact maps 

As defined in (1), the energy of an HP protein does not depend explicitly 
on its conformation but only on the information of which pairs of monomers 
form contacts. This information is contained in the contact map. In general, 
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Fig. 4. Two different conformations which have a single contact between their first 
and fourth monomers. Both belong to the same contact map. 

more than one conformation corresponds to a given contact map (see Fig. 4). 
Therefore, it is possible to improve exact enumeration in terms of contact 
maps: First, all self-avoiding walks of a given length are enumerated once in 
order to generate the complete set of contact maps. In a second step, designing 
sequences are identified by running through the set of contact maps for each 
sequence. A sequence is identified as designing if there is exactly one contact 
map that corresponds to the lowest energy and, in turn, if there is exactly one 
self-avoiding walk corresponding to that contact map. 

The first step of this enumeration procedure requires all contact maps to be 
stored in memory. Some straightforward properties of contact maps can be 
used in order to occupy as little memory as possible. Apart from symmetry 
(Cij = Cji) and the trivial facts that self-contacts are not meaningful [Ca = 
0) and that, by definition, covalently bound monomers are not counted as 
contacts (C^ — if \i — j\ — 1), these properties are: 



which are easily seen to be consequences of the considered sc lattice geometry. 
Figure 5 illustrates these properties for the N = 8 case. There are Z 8 = 9 
possible contacts for conformations of that length, i.e., nine bits are required 
to store any such contact map in memory. In general, this number can be 
calculated to be 



For each contact map C, we also accumulate the number g c (C) of self-avoiding 
walks corresponding to that contact map. In the determination of g c {C) the 
trivial symmetries described above are automatically excluded. We include 
them in a separate quantity g s (C) given by X)/ S (X), where the sum runs 
over all g c (C) conformations that correspond to C and / S (X) is given by (4). 
Knowledge of g s (C) is necessary for the calculation of thermodynamic quanti- 
ties and we store it for each contact map, too. Furthermore, we retain for each 
contact map C the last chain code that we enumerate and whose conformation 



= if \i — j\ — 2n, n = 1,2, ... , 



(8) 






(10) 
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Fig. 5. Properties of contact maps of conformations consisting of N = 8 monomers. 
Contacts that correspond to entries that are crossed out cannot be formed by con- 
formations on the sc lattice. There are nine pairs of residues that can possibly be 
in contact (indexed fields). 

corresponds to C . In particular, this yields all conformations corresponding to 
contact matrices with g c {C) = 1 which allows to determine the ground-state 
conformations of designing sequences. 



3.2.3 Parallelization 

The number of contact maps that can be simultaneously held in memory was 
increased by distributing them over several individual processors (IPs). We 
implemented the corresponding program according to the Message Passing 
Interface (MPI) standard [19] and executed it on the local Linux cluster Ha- 
grid 1 , consisting of 40 Athlon 1800+ MHz processors with 100 Mbit Ethernet 
communication. Figure 6 shows the basic structure of the program. The master 
process Pi generates all self-avoiding walks X and the corresponding contact 
maps C (X) . For each self-avoiding walk, it can behave in three different ways: 

(1) If the memory associated to Pi is not yet filled up and C(X) was not 
stored in this memory partition before, C (X) will be appended to the list 
of contact maps stored in Pi. 

(2) If C(X) is already stored in P 1 the corresponding counters g c (C) and 
g s {C) will be increased by one and / S (X), respectively. 

(3) If C(X) is not stored in P x and its memory is completely filled, C(X) 

1 http : //www . physik . uni-leipzig . de/Computer/Hagrid 
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Fig. 6. Generating contact maps in terms of parallel distributed programming. Two 
individual processors (IPs), Pi and P2, are shown. Each IP but Pi disposes of an 
input buffer Ij, and there is an output buffer 0, for all IPs but the last one. Buffering 
allows for faster information communication as less MPI messages have to be sent 
between the IPs. 



will be stored in the master's output buffer Oi. When the output buffer 
is filled up all contact maps in the buffer will be transferred to the next 
IP's input buffer I 2 . 



This way of storing contact maps is termed selective insertion in Fig. 6. The 
behavior of the slave processes P 2 , P3, ... is very similar. The difference is that 
their input buffers serve as sources of contact maps, they do not perform any 
kind of enumeration. Their mere purpose is lookup and storage of contact 
maps. 



The second step of the enumeration, i.e., going through all contact maps for 
all sequences, can be trivially parallelized in order to reduce the running time. 
We achieved this by simply distributing the set of contact maps over all IPs. 
Then, each IP performs the enumeration with respect to its subset of contact 
maps. Finally, all IPs send their enumeration results to the master process 
which compares the lowest energies that were found by the slaves in order 
to find the "globally" minimal energies and the correct degeneracies g c and 
g s . The speed-up factor due to this parallelization is virtually equal to the 
number of available processors. 
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3.3 A simple example 



In the following, we illustrate briefly how the improvements discussed above 
apply to the very simple N = 4 example. There are i? 4 = 10 relevant sequences 
which we store in the array 

S = [PPPP, PPPH, PPHP, PPHH, PHPH, PHHP, PHHH, HPPH, HPHH, HHHH] . (11) 

Also, there are six relevant conformations as shown in Fig. 1. In the first 
enumeration step, all contact maps are determined. Five conformations (FFF, 
FFL, FLF, FLR, and FLU) have no contacts and belong to the trivial empty 
contact map which we will call C^°\ The conformation encoded by FLL has 
a single contact between its first and fourth residues; we refer to its contact 
map as Thus, there are only two contact maps with g c (C^) = 5 and 

ffc(C (1) ) = 1, and we compute g s (C^) = / S (FFF)+/ S (FFL)+/ S (FLF)+/ S (FLR) + 
/ S (FLU) = 126 and g s (C^) = /.(FLL) = 24. 

In the second step, we run through all sequences from (11) for each of the two 
contact maps. The enumeration requires four more arrays of length i? 4 = 10 in 
order to store the lowest energy, E, the accumulated degeneracies, G c and G s , 
and an example conformation for each sequence, W. After evaluation of (1) 
for all sequences with respect to these arrays read 

E = [0,0,0,0,0,0,0,0,0,0], (12) 

G c = [5,5,5,5,5,5,5,5,5,5], (13) 

G s = [126, 126, 126, 126, 126, 126, 126, 126, 126, 126], (14) 

W = [FFF, FFF, FFF, FFF, FFF, FFF, FFF, FFF, FFF, FFF]. (15) 

Calculating now the energies with respect to yields once more E = 
for the first seven sequences and a lower energy E = — 1 for the last three 
sequences in (11). The arrays have to be updated accordingly. This means that 
for the sequences with energy E = the counter for the conformations, G c , is 
incremented, while it is reset for the other three sequences, since their energies 
are lower now. The degeneracy G s which includes the symmetry factors for 
the conformations is accumulated appropriately and the new conformations 
possessing lower energies than those in the previous step (15) are stored in 
W: 



E= [0,0, 0,0, 0,0, 0,-1, -1,-1], (16) 

G c = [6, 6, 6, 6, 6, 6, 6, 1,1,1], (17) 

G s = [150, 150, 150, 150, 150, 150, 150, 24, 24, 24], (18) 

W = [FFF, FFF, FFF, FFF, FFF, FFF, FFF, FLL, FLL, FLL]. (19) 



This shows that there are three designing sequences HPPH, HPHH, and HHHH 
corresponding to the entries equal to unity in (17), and from the corresponding 
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entries in (19) we read off that, in this example, all three sequences possess the 
same unique ground-state conformation FLL with energy E = —1, as stored 
in (16). 

Of course, parallelization must seem quite artificial in this very simple exam- 
ple. It would correspond to distributing the two contact maps and 
over two different IPs. 



4 Applications 

Table 1 and the corresponding Fig. 7 show how the number of self-avoiding 
walks, Cat, grows in comparison to the number of contact maps, M N , for 
chain lengths iV < 19. For a given chain length, there are many more self- 
avoiding walks than contact maps. The ratio between both numbers shown 
in the rightmost column of Table 1 keeps growing with n = N — 1. The 
exponential growth, as suggested by Fig. 7, can generically be described by 
the following scaling ansatz [20,21]: 

C n = Aii n n<-\ (20) 

where 7 is a universal exponent and /i the effective coordination number. 
For self-avoiding walks we reproduce the well-known results /isaw ~ 4.684 

Table 1 

Number of self-avoiding conformations CV and contact maps Mjy on a sc lattice. 
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n = N - 1 
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\C N /M N 
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and 7 ~ 1.16 [16,17,18] by means of a ratio method analysis (see, e.g., 
Refs. [20,22,23]). Assuming a scaling form (20) also for the number of contact 
maps (CM), a similar analysis yields //cm ~ 4.38, i.e., their (still) exponen- 
tial growth is slower than that of self-avoiding walks (see Ref. [12] for more 
details). 

In the second enumeration step we determined all designing sequences of 
length iV < 19, their numbers are shown in Table 2. As the interaction is 
more complicated in the MHP case, it is intuitively clear that degeneracies 
are lifted and that there are hence more designing sequences for that model 
than in the HP case. We also note that there are fewer designing sequences in 
the HP model on the sc lattice than for the same model and the same lengths 
N on the square lattice [13]. 



Table 2 

Numbers of designing sequences Sn (only relevant sequences, see text) in the HP 
and MHP models. 
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5 Summary and Outlook 

In the first part of our exact enumeration procedure we generated the com- 
plete sets of contact maps for self-avoiding walks of n < 18 steps, i.e. for 
conformations of up to N = 19 monomers. We parallelized our program such 
as to distribute the set of contact maps over several memory partitions of a 
Linux cluster. In the second step of enumeration, we determined all designing 
sequences for both types of interactions considered. Here, parallelization is 
used to decrease the required computer time. 

The results obtained this way can be used in a statistical analysis of design- 
ing sequences and their ground-state conformations. First, in the space of 
sequences, we can discuss the hydrophobicity, i.e., the H-content of design- 
ing sequences as well as hydrophobicity profiles describing the distribution of 
hydrophobic monomers in the polymer chain. Additionally, it enables us to in- 
vestigate in how far monomers are involved in the formation of HH contacts 
(and HP contacts in case of the MHP model) by defining hydrophobic con- 
tact density profiles. Second, in the space of conformations, the data obtained 
herein allow for the study of the end-to-end distances and radii of gyration 
as measures of the compactness of designed conformations. The considera- 
tion of the distribution of the designability of designed conformations shows 
that some conformations are preferred over others as ground-state conforma- 
tions of designing sequences. The complete statistical analysis can be found 
in Ref. [12]. 

Finally, it should be pointed out that a slight variation of the enumeration 
procedure explained herein also allows for the exact determination of the den- 
sity of states g(E), i.e., the number of conformations corresponding to all 
energy levels and not just to the ground-state energy. This number includes 
all symmetries which is why we store the degeneracies g s (C) for all contact 
maps C (see Section 3.2.2). For a given HP sequence, g{E) can be used to 
determine the temperature dependence of energetic quantities, in particular 
that of the specific heat whose peaks can be associated with conformational 
transitions [12]. 
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